feat: make ICU the default FTS tokenizer#6968
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
westonpace
left a comment
There was a problem hiding this comment.
Seems like good rationale. The only thing that makes me slightly hesitant is having default_base_tokenizer depend on a feature flag since it might be confusing.
Still, it's in the default feature list, and it's always going to be on for wheels / pylance, so I guess the vast majority of users will never be clearing this flag.
I'm open to just remove this feature and make it always enabled. |
This changes the default native FTS tokenizer from
simpletoicuso new inverted indexes handle mixed-language text without requiring users to opt into multilingual tokenization. Legacy missing tokenizer metadata continues to resolve tosimple, and builds without the ICU feature still fall back tosimple.Benchmark summary from the 100M-row runs:
The key tradeoff is that ICU has modest overhead for English-only data, but it fixes default recall for unspaced CJK, Japanese, and Thai text.
Detailed benchmark numbers
English-only 100M rows
Mixed-language 100M rows
simplemisses